Support WDL call caching #5105

adamnovak · 2024-09-27T02:24:35Z

This adds support for WDL call caching compatible with MiniWDL, controlled by the same MiniWDL config/environment variables as MiniWDL uses. This should fix #4797.

It only caches task calls, and the write_* WDL function results necessary to have calls that depend on workflow-generated files.

It copies all output files from a Toil step into the MiniWDL cache folder, when saving to the cache, since Toil doesn't usually generate a persistent copy of task output files.

It does not cache downloads like MiniWDL can, so tasks that depend on URL files probably can't cache (or might get stuck cached and not update when the URL content changes).

It doesn't do anything special for string to File coercion, but probably should.

If you have this workflow:

version 1.1

workflow TestWorkflow {
    input {
    }
    call TestTask {
    input:
    }
    call WCTask as count_listing {
        input:
            to_count = TestTask.dirlist
    }
    call WCTask as count_dynamic {
        input:
            to_count = TestTask.trouble
    }
    call WCTask as count_workflow_scope {
        input:
            to_count = write_lines(["At", "workflow", "scope"])
    }
    output {
        Array[File] counts = [count_listing.result, count_dynamic.result, count_workflow_scope.result]
    }
}

task TestTask {
    input {
    }
    command <<<
        ls -l
    >>>
    output {
        Int number = 1
        File dirlist = stdout()
        File trouble = write_lines(["Hello", "World"])
    }
    runtime {
        container: "ubuntu:latest"
        cpu: 1
    }
}

task WCTask {
    input {
        File to_count
    }
    command <<<
        wc -l ~{to_count}
    >>>
    output {
        File result = stdout()
    }
    runtime {
        container: "ubuntu:latest"
        cpu: 1
    }
}

Then you can run it with either of:

MINIWDL__CALL_CACHE__GET=true MINIWDL__CALL_CACHE__PUT=true MINIWDL__CALL_CACHE__DIR=`pwd`/miniwdl-cache miniwdl run test.wdl

Or

MINIWDL__CALL_CACHE__GET=true MINIWDL__CALL_CACHE__PUT=true MINIWDL__CALL_CACHE__DIR=`pwd`/miniwdl-cache toil-wdl-runner --logDebug --jobStore ./tree test.wdl

And the second one will get cache hits for all the tasks that the first one ran.

Changelog Entry

To be copied to the draft changelog by merger:

Toil now supports reading and writing MiniWDL's call cache.

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passes tests.
Make sure the PR has been reviewed since its last modification. If not, review it.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

…d-fs-caching

…RIs as URIs for import

adamnovak · 2024-09-27T22:15:07Z

I need to make this also support caching calls to whole workflows. If I use the files in the output tasks' caches instead of fetching the cached results of one workflow, I get different paths (that symlink to the same files), and then I can't re-use MiniWDL cached calls that depend on files from the called workflow.

…o me more useful

Also fix linkImports from workers, and remove the weird contect manager hardlink setup that didn't appear to work anyway.

adamnovak · 2024-09-30T21:05:10Z

I think we might run into problems with sibling files with WDL call caching when reading caches written by MinIWDL. MiniWDL will lay out task outputs by output name, and then reference those locations in the cache files. So files that were siblings won't be anymore when loaded from MiniWDL's task output directories.

I think we might just have to let that not work. When Toil writes files into the cache to reference them it should obey the sibling files constraint because it uses the same code as when writing them for a task to pick paths.

adamnovak · 2024-09-30T21:15:43Z

https://ucsc-ci.com/databiosphere/toil/-/jobs/79122 got a Cannot assign requested address errors trying to do an HTTP download in src/toil/test/wdl/wdltoil_test.py::WDLTests::test_miniwdl_self_test_by_reference. I don't think it has anything to do with that test specifically, except that it needs to do networking. We might have a problem with network connectivity on the Gitlab runners in addition to the problem that led us to assign a static IP to the main Gitlab server.

DailyDreaming

This looks alright to me. Is there a way a test could be added?

DailyDreaming · 2024-10-03T23:16:22Z

src/toil/batchSystems/abstractGridEngineBatchSystem.py

@@ -267,7 +268,7 @@ def _runStep(self):
            if self.checkOnJobs():
                activity = True
            if not activity:
-                logger.debug('No activity, sleeping for %is', self.boss.sleepSeconds())
+                logger.log(TRACE, 'No activity, sleeping for %is', self.boss.sleepSeconds())


Should this go back to debug?

No, I lowered this to TRACE since I didn't think it made sense to log every second at debug level.

DailyDreaming · 2024-10-03T23:18:31Z

src/toil/utils/toilStatus.py

@@ -49,14 +49,14 @@ def print_dot_chart(self) -> None:

        # Make job IDs to node names map
        jobsToNodeNames: Dict[str, str] = dict(
-            map(lambda job: (str(job.jobStoreID), job.jobName), self.jobsToReport)
+            map(lambda job: (str(job.jobStoreID), str(job.jobStoreID).replace("_", "___").replace("/", "_").replace("-", "__")), self.jobsToReport)


Hmmm... name_/something seems equivalent to name/_something after conversion. Is that alright?

Oh good point, I didn't think about adjacent sequences.

DailyDreaming · 2024-10-03T23:26:34Z

src/toil/wdl/wdltoil.py

+                        imported = file_dest.import_file(candidate_uri, check_existence=False)
+                    else:
+                        # The file store import_file doesn't do an existence check.
+                        # TODO: Have a more compatible interface.


In what way?

…nto issues/4797-shared-fs-caching

adamnovak · 2024-10-04T23:58:42Z

@stxue1 Can you maybe look at this? I tried to merge my code that tacks a shared filesystem path onto every File that has one with your code that defers virtualization to task boundaries, but I know I'm missing at least one route that files can come in, because I don't think I am managing to add the shared filesystem path when a workflow-level string is coerced to a file and then virtualized. Where is that happening, and at that point do we have a good way to distinguish a leader-filesystem file from a worker-filesystem file?

Also, I went through and turned all the setattr file mutations and file.value = mutations into immutable replacements of the file with the version that has the attribute set/has its value set, because I was not confident that I understood the implications of changing the original File object. Is this approach going to work?

stxue1 · 2024-10-08T19:09:19Z

@stxue1 Can you maybe look at this? I tried to merge my code that tacks a shared filesystem path onto every File that has one with your code that defers virtualization to task boundaries, but I know I'm missing at least one route that files can come in, because I don't think I am managing to add the shared filesystem path when a workflow-level string is coerced to a file and then virtualized. Where is that happening, and at that point do we have a good way to distinguish a leader-filesystem file from a worker-filesystem file?

The virtualization for when a workflow-level string being coerced to a file takes place in two areas. On task boundaries, virtualization is handled when virtualize_files() is called manually. The other place is when _read/_write is called, which happens when a string value is used implicitly in a standard library function that accepts files. Both ultimately point back to the _virtualize_file function, so I believe for optimal coverage caching should be right before this function is called.

I think the two places that matter is in the virtualize_filescall at the task boundary in WDLTaskWrapperJob (I think a file can be coerced from a string here, but I will have to double check). The other place that would matter is in the _read and _write calls in the standard library. I think it would be the _read call in ToilWDLStdLibBase and the _write call in ToilWDLStdLibTaskCommand, though my immediate hunch is that the TaskCommand stdlib shouldn't be touched as that is more for miniwdl internally (I left a docstring for the reasoning why they differ, if it isn't clear I can come up with a better docstring). It seems like the new ToilWDLStdLibWorkflow._write does properly handle the virtualization and adds the caching system along with it. I think the function to check if caching is needed/supported is the ToilWDLStdLibBase, which still seems to be used in certain areas.

Also, I went through and turned all the setattr file mutations and file.value = mutations into immutable replacements of the file with the version that has the attribute set/has its value set, because I was not confident that I understood the implications of changing the original File object. Is this approach going to work?

The mutable replacements should be fine, the metadata they copy over includes the necessary information for virtualization.

stxue1 · 2024-10-08T19:53:53Z

at that point do we have a good way to distinguish a leader-filesystem file from a worker-filesystem file?

I suppose there isn't currently a very good way to distinguish if the file.value is pointing to a worker filesystem file or a leader-filesystem file. Any use of a file will automatically try and read from the virtualized value, creating it if it doesn't exist (and thus uploading it to the jobstore), then changing the file.value field to the correlated devirtualized path. This behavior is left over from before when we read virtualized values that were set in the file.value field. Perhaps we should change it so the file.value field always points to a path that exists on the associated worker/leader.

My main concern is figuring out where exactly to do this without breaking path substitution support. In theory, I think at the beginning of every job should be fine.

stxue1 · 2024-10-08T19:55:29Z

src/toil/wdl/wdltoil.py

+                )
+
+                # Apply the shared filesystem path to the virtualized file
+                set_shared_fs_path(virtualized_file, exported_path)


Missing an assignment?

adamnovak · 2024-10-09T17:26:36Z

Steven's File Facts

The original string value is usually stored as the File value, unless the File is virtualized and sent to a task command job, and the task tries to use it, since there inside the task it will have a value pointing to a copy from the file store. If the file is passed through the task at the WDL level, when it gets back to the workflow scope it has the original string in the value.

In the task wrapper the original string path is used.

JSON-based inputs get stored as file:/usr/lib/whatever but string coercions get stored as /usr/lib/whatever.

Strings coerced inside task output sections appear as toilfile: URIs at workflow scope.

…d-fs-caching

…through

… importing, and remove banned mutation

adamnovak added 13 commits August 22, 2024 12:23

Start on helpers for working with MiniWDL's call cache

03177ce

Merge remote-tracking branch 'upstream/master' into issues/4797-share…

5d4b11f

…d-fs-caching

Add cache put and get but no real handling of files

6d30200

Merge remote-tracking branch 'upstream/master' into issues/4797-share…

f83f122

…d-fs-caching

Hide Toil files in MiniWDL cache and assign cache file paths

d48c7dd

Interpose a str subclass but still put toilfile URIs in cache somehow

074c865

Get cache to interoperate with MiniWDL for simple file cases

e38eeac

Connect Toil to MiniWDL write_* file cache

99c68f3

Satisfy MyPy via crimes

ca89abf

Merge remote-tracking branch 'upstream/master' into issues/4797-share…

aeaf5f5

…d-fs-caching

Stop redefining output_bindings

5ddc9ab

Stop looking for : in URI scheme part

0512abb

Detect file URIs sneaking into local shared paths and treat file:// U…

37a8771

…RIs as URIs for import

adamnovak added 6 commits September 27, 2024 15:17

Unrelatedly quiet inner loop logging and get toil status dot output t…

aa2846e

…o me more useful

Let task and workflow calls cache the same way

d6902b0

Make sure files are imported

421f9af

Also fix linkImports from workers, and remove the weird contect manager hardlink setup that didn't appear to work anyway.

Stop logging WDL imports twice

a6d9c64

Only make a cache in the workflow std lib when we really need it

772a202

Don't make MiniWDL cache in the middle of parent class constructor

e76f346

adamnovak mentioned this pull request Sep 30, 2024

Stop redefining output_bindings to fix CI #5107

Merged

19 tasks

Merge master into issues/4797-shared-fs-caching

76c5723

stxue1 mentioned this pull request Oct 1, 2024

Add WDL unit tests to CI #5110

Merged

19 tasks

DailyDreaming approved these changes Oct 3, 2024

View reviewed changes

adamnovak added 4 commits October 4, 2024 14:04

Hackily deal with simultaneous attempts to cache the same call

fa20533

Merge remote-tracking branch 'origin/issues/4797-shared-fs-caching' i…

170b4bd

…nto issues/4797-shared-fs-caching

Make escaping more principled

25ffe47

Revise TODO

0bfcaf3

Attempt to merge shared FS path tracking and boundary virtualization

6ee635e

github-actions bot added 2 commits October 7, 2024 15:37

Merge master into issues/4797-shared-fs-caching

31205fb

Merge master into issues/4797-shared-fs-caching

985453d

stxue1 reviewed Oct 8, 2024

View reviewed changes

Merge master into issues/4797-shared-fs-caching

48ee0bb

adamnovak and others added 19 commits October 14, 2024 15:36

Merge remote-tracking branch 'upstream/master' into issues/4797-share…

ab787c1

…d-fs-caching

Satisfy MyPy

9a80836

Add debugging and fix file setters and add file getters

99e45cd

Stop asserting file accessors work

01f4669

Remove comment that no longer explains how file management works

2a2e3ba

Update comments

040662d

Devirtualize the virtualized path to fill cache and pass export flag …

3ecdd3f

…through

Save modifications from file search, poll local files instead of just…

fe0c1b6

… importing, and remove banned mutation

Complain about memory

d90ccb5

Merge master into issues/4797-shared-fs-caching

5e222e6

Merge master into issues/4797-shared-fs-caching

acdd4cc

Add local to appease MyPy

514c682

Merge master into issues/4797-shared-fs-caching

27d8d9f

Merge master into issues/4797-shared-fs-caching

7a76089

Merge master into issues/4797-shared-fs-caching

cf71336

Merge branch 'master' into issues/4797-shared-fs-caching

ca12307

Fix reference problems introduced by merge

5cbcc54

Merge master into issues/4797-shared-fs-caching

a675685

Merge master into issues/4797-shared-fs-caching

8473dc9

adamnovak merged commit 59d5c86 into master Oct 28, 2024
1 check passed

adamnovak deleted the issues/4797-shared-fs-caching branch October 28, 2024 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support WDL call caching #5105

Support WDL call caching #5105

adamnovak commented Sep 27, 2024

adamnovak commented Sep 27, 2024

adamnovak commented Sep 30, 2024

adamnovak commented Sep 30, 2024

DailyDreaming left a comment

DailyDreaming Oct 3, 2024

adamnovak Oct 4, 2024

DailyDreaming Oct 3, 2024

adamnovak Oct 4, 2024

DailyDreaming Oct 3, 2024

adamnovak commented Oct 4, 2024

stxue1 commented Oct 8, 2024

stxue1 commented Oct 8, 2024 •

edited

Loading

stxue1 Oct 8, 2024

adamnovak commented Oct 9, 2024

Support WDL call caching #5105

Support WDL call caching #5105

Conversation

adamnovak commented Sep 27, 2024

Changelog Entry

Reviewer Checklist

Merger Checklist

adamnovak commented Sep 27, 2024

adamnovak commented Sep 30, 2024

adamnovak commented Sep 30, 2024

DailyDreaming left a comment

Choose a reason for hiding this comment

DailyDreaming Oct 3, 2024

Choose a reason for hiding this comment

adamnovak Oct 4, 2024

Choose a reason for hiding this comment

DailyDreaming Oct 3, 2024

Choose a reason for hiding this comment

adamnovak Oct 4, 2024

Choose a reason for hiding this comment

DailyDreaming Oct 3, 2024

Choose a reason for hiding this comment

adamnovak commented Oct 4, 2024

stxue1 commented Oct 8, 2024

stxue1 commented Oct 8, 2024 • edited Loading

stxue1 Oct 8, 2024

Choose a reason for hiding this comment

adamnovak commented Oct 9, 2024

Steven's File Facts

stxue1 commented Oct 8, 2024 •

edited

Loading